Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A New importer #1518

Closed
wants to merge 4 commits into from
Closed

A New importer #1518

wants to merge 4 commits into from

Conversation

vodik
Copy link
Contributor

@vodik vodik commented Feb 26, 2018

This PR replace the loader with a more correct implementation, and then introduces some refactoring work.

I've added the necessary bits to provide a Python 3.4+ compliant import machinery (works on Python 3.3 as well). This means we now have our own:

  • A PathFinder, which we register as a sys.meta_path entry
  • A FileFinder which is responsible for finding Hy code in the filesystem
  • And finally our own Loader, which matches the Python implementation but for Hy instead of Python

There's also a compatibility later for Python 2:

  • A Importer to sit in sys.meta_path and properly implements the recursive module loading mechanism imp.find_module expects.
  • A Loader that makes the Python 3 loader compatible with Python 2.

On top of that, functionality from Python 3 is backported to two, like _atomic_write, to improve the correctness of the code that writes bytecode to disk.

With these changes, the following issues are address:

I want to spend a bit more time refactoring hy_compile, ast_compile, hy_parse and hy_eval as I want to expose a Hy API with similar signature as Python's parse, compile, and exec.

But I think its in a reviewable state as it is.

There are still a few outstanding tasks:

  • Fix tests and add more tests
  • Document the loader, some of the gotchas (e.g. Hy code will be interpreted as a namespace module in Python)
  • Open an issue with Python about namespace modules not validating extensions properly? Dig into this a bit more. (But we'll have to continue working around it anyways)
  • Cleanup compilation API
  • Restore hyc - currently stubbed

Some decisions that have to be made still:

  • Should we support Hy code in runpy.run_path (requires monkeypatching, see _get_code_from_file) (currently leaning on no, but its implemented)
  • Should we use runpy.run_path? It means hy foo.py would work as well as hy foo.hy, but maybe again that's desirable (currently leaning on no, but its implemented)
  • Should we have our own bytecode implementation/extension (we can also leave this for later)
  • Should we drop Python 3.3 support? While the loader is compatible with Python 3.3, the correct approach to loading modules manually with importlib by filename is different (Python 3.4 made the find_spec API public). Mostly has implications for unit testing, but I think its old enough we can drop it. For some reason I thought we supported 3.3. Never mind then.

parent = os.path.join(parent, part)
try:
os.mkdir(parent)
except FileExistsError:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, Python 2 doesn't support this

@@ -280,19 +280,19 @@ def test_bin_hy_no_main():
@pytest.mark.parametrize('scenario', [
"normal", "prevent_by_force", "prevent_by_env"])
@pytest.mark.parametrize('cmd_fmt', [
'hy {fpath}', 'hy -m {modname}', "hy -c '(import {modname})'"])
'hy -m {modname}', "hy -c '(import {modname})'"])
Copy link
Contributor Author

@vodik vodik Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something worth noting: python foo.py won't emit bytecode for foo.py. We now match that behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made hy foo.hy produce bytecode on purpose. The speed boost is a lot bigger for Hy than Python, so you shouldn't have to import from the file rather than running it directly in order to get bytecode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll revert that and add documentation for it.

vodik added 3 commits March 31, 2018 15:12
Add the necessary bits to provide a Python 3.4+ compliant import
machinery. This means we now have our own PathFinder, which we register
as in sys.meta_path, our own FileFinder, and our own Loader.

Add a compatability later that builds on the Python 3 machinery to make
it best-effort backwards compatible with Python 2.
@vodik
Copy link
Contributor Author

vodik commented Mar 31, 2018

This one has been sitting on the sidelines too long, hopefully I can wrap this one up today.

Moving documentation to org and a gist, I'll convert it to rst before merging: https://gist.github.com/vodik/efe2310b8d0e5c065a1a38e582d4e580

We can't reply on hy_eval here as it doesn't properly associate the
appropriate file information.

Fixes hylang#1395
@vodik
Copy link
Contributor Author

vodik commented Apr 1, 2018

This is still coming along, I just hate writing documentation.

@asmodehn
Copy link

Hi @vodik, I was actually working on implementing a working importer for hy in palimport when I found out this PR.

I m not yet familiar with hylang, but it looks like you reimplemented an importlib... It might be necessary or not, not sure how much you want to deviate from python import semantics (namespaces packages, bytecode location and all that...) But I though you might want to know about filefinder2, which I maintain, as I need custom imports (py2/py3) for other projects. The goal is to eventually phase it out when py3 takes over, but it currently is a good base to start when writing a custom importer.

Reusing filefinder2 might help you implement importlib by relying on python3, without too much code for py2 compatibility. Anyway, Let me know if I might be of some help here.

@asmodehn
Copy link

FYI, I got a simple importer to work on a simple usecase there : https://github.com/asmodehn/palimport/pull/6/files

@vodik
Copy link
Contributor Author

vodik commented May 28, 2018

I m not yet familiar with hylang, but it looks like you reimplemented an importlib...

Hey, yeah. So most of the reimplementation is for Python 2 compatibility. Drop Python 2 and 3/4 of that codebase disappears - I can rely on importlib directly. My intention was to reuse the loader between Python 2 and 3, and then implementing the corresponding import protocols.

I might have done stuff wrong, but the loader should respect Python 2 conventions, like how submodules as supposed to be loaded recursively, while also leveraging Python 3.4+ semantics on Python 3 with the new module spec system. Bytecode generation and related stuff is there because, as far as I can tell, its not exposed in Python 2. Only in 3.

It might be necessary or not, not sure how much you want to deviate from python import semantics (namespaces packages, bytecode location and all that...)

I only deviate with namespace packages.

Maybe you can help me here, but as far as I can tell, I have to break namespace modules. This is a Python 3 bug. In theory, Hy should work just fine if I add a new Path/File finders paired with the appropriate loader. But this actually breaks Python. Here's the problematic function inside importlib dealing with package detection:

https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py#L717

def is_package(self, fullname):
    """Concrete implementation of InspectLoader.is_package by checking if
    the path returned by get_filename has a filename of '__init__.py'."""
    filename = _path_split(self.get_filename(fullname))[1]
    filename_base = filename.rsplit('.', 1)[0]
    tail_name = fullname.rpartition('.')[2]
    return filename_base == '__init__' and tail_name != '__init__'

Notice how its not actually checking for __init__.py? Python is happy to accept __init__.hy as a sign we're inside a Python package (or another extension, really). Hy code gets loaded as a Python namespace module.

The workaround is to that Hy must have its own meta loader, which needs to come first, and - to avoid the problem of then loading __init__.py as a Hy namespace module, we have to explicitly break namespace module support. I've been meaning to open a CPython bug for this. I should just do it. I'll get something prepped....

(Well, strictly speaking, meta loader isn't necessary here - but its convenient for being able to handle overlapping Py/Hy packages when different submodules are in different languages - which we rely on internally.)

But I though you might want to know about filefinder2, which I maintain, as I need custom imports (py2/py3) for other projects. The goal is to eventually phase it out when py3 takes over, but it currently is a good base to start when writing a custom importer.

So long as you do a good job of matching Python 2's expected behaviour (instead of backporting 3), I'm game for trying to merge what I've learned here back. And if there's interest in using it besides me, I don't see a problem with it then.

@vodik
Copy link
Contributor Author

vodik commented May 28, 2018

Oh right, part of my slowness here is I'm trying to draft documentation covering how all this works, the why, and the gotchas. Might as well relink it:

https://gist.github.com/vodik/efe2310b8d0e5c065a1a38e582d4e580

Feedback is welcome if I got anything obviously wrong.

But this is certainly a topic where resources as scarce and I kinda wish I found your stuff earlier. Having a single project for this probably makes a lot of sense.

@asmodehn
Copy link

I might have done stuff wrong, but the loader should respect Python 2 conventions, like how submodules as supposed to be loaded recursively,

I would be more careful about respecting python3 conventions... because the end is near.
What do you mean submodules are loaded recursively ? They should be loaded if and only if the __init__.py import them, no ? Maybe something I am not aware of (or already forgot about) regarding python2.

Bytecode generation and related stuff is there because, as far as I can tell, its not exposed in Python 2. Only in 3.

Yes, and it is good to have it (if you need it). I didn't implement that in filefinder2 because I had no usecase for it. Meaning, filefinder2 will use the one in python interpreter (2 or 3, the one that is available), I had no need to modify it yet, so no need to reimplement it there. Also I don't care much about having bytecode or not in python2. Talking about it now, I wonder if I even have test cases for it...

If hylang needs bytecode, it would be great if you can find some time to integrate that in filefinder2( with test cases !). That package is specifically for python 2/3 compatibility, and deviation from py3 will need to be implemented outside of filefinder2 (inheriting from filefinder2 where suitable). Implementing that might be quite an adventure however...

I only deviate with namespace packages.
Maybe you can help me here, but as far as I can tell, I have to break namespace modules.

I don't think so (or I don't get the problem yet...)
The code you refer to is in the Loader. Deciding for a namespace package or not, is at the PathFinder level.
Because python is happy with a __init__.hy to be a normal package and load it, we should just add this logic into the PathFinder.find_spec method. I've put more details in https://gist.github.com/vodik/efe2310b8d0e5c065a1a38e582d4e580, so let's keep discussing about it there.

So long as you do a good job of matching Python 2's expected behaviour (instead of backporting 3),

I do backport python3, to match its behavior by default, but another goal is for anyone to be able to modify (by inheritance) the behavior where needed, to :

  • minimize the cost of porting custom importer systems to (latest) python3.
  • make it simple to implement a custom importer with proper behaviors when no deviation is explicitely required
  • make it obvious where a custom importer deviates from python3 import logic
  • make it obvious in the code where importers are enabled or not (using context managers)

I'm game for trying to merge what I've learned here back. And if there's interest in using it besides me, I don't see a problem with it then.

Oh yes, please do.
I think it is beneficial in general to push all the python2 backward compatibility code to external packages (2to3 or six), I just didn't find one that was allowing to also do custom imports.
Doing that will really reduce the code complexity in hylang.

Also there are tests there to validate many usecases in all (supported) python versions, and that can help a lot not to break someone's python because of a small import statement on a python version that wasn't tested.

Also the more use cases we have for filefinder2, the cleaner it will be, and python3 import upgrades will be less messy, when we will have to finally let python2 die.

@vodik
Copy link
Contributor Author

vodik commented May 29, 2018

Okay, writing this in a rush, so hopefully I'm clear. Sorry if not.

I might have done stuff wrong, but the loader should respect Python 2 conventions, like how submodules as supposed to be loaded recursively,

I would be more careful about respecting python3 conventions... because the end is near.
What do you mean submodules are loaded recursively ? They should be loaded if and only if the init.py import them, no ? Maybe something I am not aware of (or already forgot about) regarding python2.

Did you mean respecting Python 2 conventions?

Yes, I realize the end is near. If it where up to me, we'd drop Python 2 tomorrow. But in the meanwhile, I have to support it. And, in my humble opinion, the best way to do it is to match Python 2's loader's behaviour (not backport Python 3's behaviour), and do it in such a way that a simple patch will drop code, rather than requiring rewites.

As for recursive:

If it is a package, as in the second line import pkg.module, the pkg is first imported using the steps above, and “__path__” is set on it, and later passed into find_module(“pkg.module”, path=pkg.__path__) to import pkg.module, this is called a submodule import. Submodules are imported this way recursively (As a result of importing pkg.module, the sys.modules will contain both pkg and pkg.module).

Source: https://chaobin.github.io/2015/06/22/understand-import-system-of-python/

^ Python 3 doesn't do this anymore. I forgot where this changed, its been about 3 months since I've last had the time to dig into this. I think it came with the newer module spec logic that came with 3.3/3.4.

Bytecode generation and related stuff is there because, as far as I can tell, its not exposed in Python 2. Only in 3.

Yes, and it is good to have it (if you need it). I didn't implement that in filefinder2 because I had no usecase for it. Meaning, filefinder2 will use the one in python interpreter (2 or 3, the one that is available), I had no need to modify it yet, so no need to reimplement it there. Also I don't care much about having bytecode or not in python2. Talking about it now, I wonder if I even have test cases for it...

It's not actually modified at the moment. I toyed with the idea and reverted it. But I've been meaning to propose reintroducing it. For example, if we where to embed a small header at the start of the bytecode file, we can choose to invalidate bytecode between Hy releases that would break things. For example, we recently went through a mangling scheme change. But this change meant all pre-existing bytecode files had the old names and thus broke. If we had control over the bytecode header, we could have at least forced Hy to invalidate.

I have a few other ideas here, but its not a simple topic and needs to be hashed out with everyone here, so I'm not making that change yet.

If hylang needs bytecode, it would be great if you can find some time to integrate that in filefinder2( with test cases !). That package is specifically for python 2/3 compatibility, and deviation from py3 will need to be implemented outside of filefinder2 (inheriting from filefinder2 where suitable). Implementing that might be quite an adventure however...

I only deviate with namespace packages.
Maybe you can help me here, but as far as I can tell, I have to break namespace modules.

I don't think so (or I don't get the problem yet...)
The code you refer to is in the Loader. Deciding for a namespace package or not, is at the PathFinder level. Because python is happy with a init.hy to be a normal package and load it, we should just add this logic into the PathFinder.find_spec method. I've put more details in https://gist.github.com/vodik/efjke2310b8d0e5c065a1a38e582d4e580, so let's keep discussing about it there.

Yeah, you're right. Its been a while since I last digged into it - its not there. Its here: https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py#L1388

Its not in the PathFinder, its in the FileFinder. The FileFinder looks for "__init__" + suffix and gives up if its a directory. So you can't really chain FileFinders (either within the same PathFinder or across PathFinders) without them all sharing the same the same concept of a module (denoted by __init__.py as decided by the first FileFinder). The idea that a namespace module is blankly returned here prevents the rest of the file finders from getting a chance to see if they know how to provide a loader.

Namespace module should be designated last, after all loaders give up.

If we had that behaviour, we could embed Hy as just another FileFinder in the standard sys.meta_loader loader instead of making our own PathFinder (which is really just a workaround - force Python to consider us first, then to explicitly break namespace packages, so FileFinder chaining will continue working).

This is why I think we're sitting on a Python bug. The current implementation of namespace modules violate spirit of PEP 302.

Let me devise a small example to demonstrate the problem:

import contextlib
import os
import sys
from importlib.machinery import FileFinder, SourceFileLoader

# So, this is counter intutive, but we're going to add a new *loader*,
# that we want to go *first*, before .py.
#
# Its easier to highlight the problem this way
sys.path_hooks.insert(0, FileFinder.path_hook((SourceFileLoader, [".badpy"])))

with contextlib.suppress(FileExistsError):
    os.mkdir("good")

with contextlib.suppress(FileExistsError):
    os.mkdir("bad")

with contextlib.suppress(FileExistsError):
    os.mkdir("contrast")

# We'll always be able to directly find modules if we don't do
# packages: good/demo.py -> good.demo works. We fall though our loader
# and the standard ones afterwards do their thing
#
# This works because the full filename matches, so __init__ stuff is bypassed.
# That's the fallback behaviour when an exact match isn't found
with open("good/demo.py", "w") as module:
    module.write("""print("Hello World")""")
    import good.demo
    print(good.demo)  # <module 'good.demo' from '/tmp/good/demo.badpy'>

# But, we can no longer import packages!
#
# This loads as a namespace module now. Why? Because the first loader
# (.badpy one) will now will look for bad/__init__.badpy, doesn't find it, but
# decides that since we're insider a folder, we a namespace
#
# No more chaining of loaders anymore, we shutdown too  early - even
# though the next one afterwards would be more than happy to do its thing.
with open("bad/__init__.py", "w") as module:
    module.write("""print("Hello World")""")
    import bad
    print(bad)  # HA! <module 'bad' (namespace)>

# But we can still load modules if we use `__init__.badpy`.
with open("contrast/__init__.badpy", "w") as module:
    module.write("""print("Hello World")""")
    import contrast
    print(contrast)  # <module 'contrast' from '/tmp/contrast/__init__.badpy'>

So long as you do a good job of matching Python 2's expected behaviour (instead of backporting 3),

I do backport python3, to match its behavior by default, but another goal is for anyone to be able to modify (by inheritance) the behavior where needed, to :

  • minimize the cost of porting custom importer systems to (latest) python3.
  • make it simple to implement a custom importer with proper behaviors when no deviation is explicitely required
  • make it obvious where a custom importer deviates from python3 import logic
  • make it obvious in the code where importers are enabled or not (using context managers)

Have a look at my design - most of the stuff is literally pulled out from Python 3.6 CPython's code base. The compat.py module does a good job of taking Python 3's machinery and making it Python 2 compatible. You don't actually need to compromise here and ignore Python 2.

I haven't had the chance to audit your code yet, but I am interested in digging into it.

I'm game for trying to merge what I've learned here back. And if there's interest in using it besides me, I don't see a problem with it then.

Oh yes, please do.
I think it is beneficial in general to push all the python2 backward compatibility code to external packages (2to3 or six), I just didn't find one that was allowing to also do custom imports. Doing that will really reduce the code complexity in hylang.

Well, it just shunts it around 😉

Also there are tests there to validate many usecases in all (supported) python versions, and that can help a lot not to break someone's python because of a small import statement on a python version that wasn't tested.

Also the more use cases we have for filefinder2, the cleaner it will be, and python3 import upgrades will be less messy, when we will have to finally let python2 die.

Finding the time to write test cases and documentation is actually what's I haven't found time to dig into at the moment.

But all things aside, this seems to be an archaic topic, with barely any good comprehensive resources, and ended up having to dig into CPython 2.7/3.6 source code to understand how the systems worked. Having a high quality, well documented library would be a boon.

@vodik
Copy link
Contributor Author

vodik commented May 29, 2018

Please, double check my work. Now that I put the effort together to write a concrete example, I should be in good shape to open a Python issue.

Its been three months since I really dug deep into this and I was already finding myself rediscovering the importlib mechanisms to explain this.

@asmodehn
Copy link

asmodehn commented Jun 2, 2018

Yes, I realize the end is near. If it where up to me, we'd drop Python 2 tomorrow. But in the meanwhile, I have to support it. And, in my humble opinion, the best way to do it is to match Python 2's loader's behaviour (not backport Python 3's behaviour), and do it in such a way that a simple patch will drop code, rather than requiring rewrites.

Hehe, that can be a very long debate. In my case I would think that having the user explicitly import in his package __init__ is a good thing anyway. But if you want to recursively import everything recursively, I would think implementing a Py2RecursivePathFinder for this specific behaviour would be simple enough.
By the way, I have no problem if you want to send a PR to filefinder2 for this, as a Py2RecursivePathFinder that the user can chose (or not) to use...

Yeah, you're right. Its been a while since I last digged into it - its not there. Its here: https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py#L1388

Its not in the PathFinder, its in the FileFinder. The FileFinder looks for "init" + suffix and gives up if its a directory. So you can't really chain FileFinders (either within the same PathFinder or across PathFinders) without them all sharing the same the same concept of a module (denoted by init.py as decided by the first FileFinder). The idea that a namespace module is blankly returned here prevents the rest of the file finders from getting a chance to see if they know how to provide a loader.

Namespace module should be designated last, after all loaders give up.

Ah I get your point. I also had a problem with this, and I agree with you in the sense that it took me by surprise at first. I wasn't expecting this design at all. But I guess I come at it from a slightly different angle. My problem is that I want to add (and potentially remove) a CustomFileFinder :

  • Inserting it in the sys.path_hook list after Python's FileFinder is not working ( Python's FileFinder decides it's a namespace path before it gets to my CustomFileFinder).
  • Inserting it in the sys.path_hook list before Python's FileFinder means I need to first find the filefinder hook in the list (which feels like a horrible hack).

So I had to go with option2 : when I implement a custom importer, I manage the "hierarchy/subfolder" logic in CustomPathFinder (if it is any different that Python3's one), and I manage the file logic in the CustomFileFinder, but I have to insert it before Python's FileFinder (like so : https://github.com/asmodehn/palimport/pull/6/files#diff-98dc28c5586b3a1c1d9b76f9aa62f134R22).

I also override find_spec in the CustomFileFinder, but do not deal with namespace package (assuming they will be managed by Python's FileFinder, when it gets to that path_hook, since I usually want the python3 behaviour). See https://github.com/asmodehn/palimport/pull/6/files#diff-6a4fa10e912a1a2253de026b1e3ba7cbR54 for an example.

Note in this simple hylang importer implementation I did not handle the (namespace or not) package case, as I wasn't sure of the desired behaviour. It might be that the inconvenient implementation (for my usecase) of python importlib right now is a major hurdle for your usecase...But I did change before the namespace package semantic for an importer (rosimport), as it was really necessary. But it turned out to be so unexpected and tricky to follow, that I vowed to never do it again, if I can help it ;-). I find that Python3 has a sensible subdirectory import semantic, when it comes to modular scripting language anyway.

Looking deeper into the code I found out this point also :

  • the PathFinder find_spec implementation defer to the FileFinder to make it a namespace package.
  • the PathFinder find_module implementation directly instantiate a NamespaceLoader.

I would (naively - can't remember if I ever tested it) assume that :

  • if you want a python2-like behaviour for modules, you should use the python2 API (PEP302), that is find_module.
  • if you want a the python3.4+-like behaviour (meaning PEP451) then you use find_spec.

Finding the time to write test cases and documentation is actually what's I haven't found time to dig into at the moment.

Feel free to take or reuse all the tests in filefinder2. You will also need to box (pytest-forked) your tests, so the state of your interpreter doesn't get polluted by previous tests. One of the problem here is that the volume of files and folder grows quickly, which is another reason why I decided to have that in a separate package.

Having a high quality, well documented library would be a boon.

Oh yes. sadly, documentation is not really my forte either...

I will now have a detail look at the code you sent here, and run it in a few different way. I ll soon post some kind of report on the gist comments.

@brandonwillard
Copy link
Member

@vodik, I just copied this branch and made some updates for those conflicting files: https://github.com/brandonwillard/hy/tree/new-importer

@Kodiologist
Copy link
Member

@brandonwillard vodik hasn't been active on Hy lately, so if you want to work on these changes and see them happen, I'd recommend opening a new PR.

@brandonwillard
Copy link
Member

I really do want to see this go through, so I'll do that.

@brandonwillard brandonwillard mentioned this pull request Aug 14, 2018
@Kodiologist
Copy link
Member

I think this was superseded by #1672.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants